ADM - Introduction to Classification models
Outline: Introduction to Classification models ¶
- Pendahuluan Model Klasifikasi
- k-Nearest Neighbour
- Evaluasi Dasar
- Underfitting-Overfitting
- Cross-validasi
- Regresi Logistik
- Naive Bayes
- Decision Tree dan Random Forest
Variabel target (dependent) dan prediktor (independent) ¶
- Variable Target: adalah satu atau lebih variabel yang dipengaruhi oleh satu atau lebih variabel yang lain. Contoh: Variabel gaji dipengaruhi oleh variabel lama kerja, pangkat serta jabatan seorang pegawai. Variable
- Variabel Prediktor : adalah satu atau lebih variabel yang mempengaruhi satu atau lebih variabel yang lain. Contoh: Variabel kecepatan mempengaruhi waktu tempuh perjalanan.
- Variabel Kontrol: adalah variabel/elemen yang nilainya tetap (konstan), biasanya pada suatu eksperimen untuk menguji hubungan antara variabel target dan prediktor. Contoh: Penggunaan Placebo (obat palsu) pada penelitian/eksperimen efek suatu obat tertentu.
- Variable Confounding Biasa juga disebut sebagai “variabel ketiga” atau “variabel mediator”, yaitu suatu (extra*) variabel yang mempengaruhi hubungan antara variabel dependent dan independent. Contoh: Pada penelitian tentang dampak olahraga (prediktor) terhadap berat badan (target), maka variabel lain seperti pola makan dan usia juga akan mempengaruhi.
Pentingnya Domain Knowledge ¶
Bentuk Struktur Data Masalah Klasifikasi ¶
- Klasifikasi adalah permasalahan meng-kategorisasikan sekelompok observasi baru ke sekumpulan kategori (kelas) yang ada sebelumnya.
- Mengacu ke Gambar dibawah, klasifikasi digunakan jika variabel target bertipe kategorik dan prediktornya satu atau lebih variabel numerik dan-atau kategorik.
Aplikasi Model Klasifikasi ¶
Berbagai Pendekatan ke Klasifikasi ¶
- Terdapat cukup banyak model klasifikasi yang dapat digunakan, mulai dari yang klasik seperti Linear Discriminant Analysis (LDA) dan regresi logistik, lalu ke moderate seperti SVM (support vector machines), decision tree dan neural network (jaringan syaraf tiruan), sampai yang lebih terkini seperti random forest, dan deep learning.
- Masing-masing memiliki kelebihan dan kekurangan masing-masing bergantung pada bagaimana model/algoritmanya.
Induktif Bias Sebagai Dasar Penting untuk Mengerti SEMUA model Data Science dan Machine Learning ¶
image source: https://sgfin.github.io/2020/06/22/Induction-Intro/
Permasalahan Klasifikasi ¶
- Misal diberikan permasalahan terdapat dua buah kategori orange dan ungu seperti di gambar.
- Setiap titik di ganmbar adalah entitas dari data yang terdiri dari beberapa variabel.
- Jika diberikan titik baru (warna putih), maka masalah klasifikkasi adalah kemudian menggolongkan data baru ini ke kategori titik Orange atau Ungu.
Mari membahas teori Bersamaan Dengan Implementasinya¶
In [1]:
# !pip install graphviz dtreeviz # Jika dijalankan di Google Colab
In [2]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, matplotlib.pyplot as plt
import time, numpy as np, seaborn as sns
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
#from dtreeviz.trees import *
#import graphviz
from sklearn import svm, preprocessing
from sklearn.gaussian_process.kernels import RBF
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
sns.set(style="ticks", color_codes=True)
"Done"
Out[2]:
'Done'
Kasus Sederhana Klasifikasi 01: Klasifikasi Spesies Bunga Iris ¶
- Data klasifikasi bunga Iris sebagai studi kasus sederhana
- Link data: https://archive.ics.uci.edu/ml/datasets/iris
- Paper sumber data: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
- Masalah klasifikasinya adalah mengklasifikasikan jenis Bunga Iris berdasarkan bentuk (e.g. panjang dan lebar) bunga.
In [3]:
# Load data Bunga Iris
data = sns.load_dataset("iris")
print(data.shape)
data.sample(5)
(150, 5)
Out[3]:
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 33 | 5.5 | 4.2 | 1.4 | 0.2 | setosa |
| 104 | 6.5 | 3.0 | 5.8 | 2.2 | virginica |
| 18 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
| 64 | 5.6 | 2.9 | 3.6 | 1.3 | versicolor |
In [4]:
data['species'] = data['species'].astype('category')
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 species 150 non-null category dtypes: category(1), float64(4) memory usage: 5.1 KB None
In [5]:
print("Duplikasi = ", data.duplicated().sum())
print(data.isnull().sum())
Duplikasi = 1 sepal_length 0 sepal_width 0 petal_length 0 petal_width 0 species 0 dtype: int64
In [6]:
data.drop_duplicates(keep="first", inplace=True)
print("Duplikasi = ", data.duplicated().sum())
Duplikasi = 0
In [7]:
p = sns.pairplot(data, hue="species")
In [8]:
# Kita membuat dataframe baru, hati-hati jika datanya besar.
df1 = data[['sepal_length','sepal_width','petal_length','petal_width']]
y1 = data['species']
df1.shape, y1.shape
Out[8]:
((149, 4), (149,))
Kasus Sederhana Klasifikasi 02: Efisiensi Energy Gedung ¶
- Terdapat 12 Macam bentuk Gedung disimulasikan dalam EcoTect. Gedung-gedung tersebut berbeda menurut beberapa parameter (e.g. glazing area, the glazing area distribution, and the orientation).
- Dari parameter tadi terdapat 768 bentuk gedung dan 8 variabel.
- aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.
- Link data: https://archive.ics.uci.edu/ml/datasets/energy+efficiency
- Paper sumber data: A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012
In [9]:
file_ = "data/building-energy-efficiency-ENB2012_data.csv"
try: # Running Locally, yakinkan "file_" berada di folder "data"
data = pd.read_csv(file_)
except: # Running in Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/{file_}
data = pd.read_csv(file_)
print(data.shape)
data.sample(5)
(768, 12)
Out[9]:
| compactness | surface-area | wall-area | roof-area | overall-height | orientation | glazing-area | glazing-dist | heating-load | cooling-load | heat-cat | cool-cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 220 | 0.71 | 710.5 | 269.5 | 220.5 | 3.5 | 2 | 0.10 | 4 | 10.66 | 13.67 | 10 | 13 |
| 545 | 0.79 | 637.0 | 343.0 | 147.0 | 7.0 | 3 | 0.40 | 1 | 42.62 | 39.07 | 42 | 39 |
| 568 | 0.64 | 784.0 | 343.0 | 220.5 | 3.5 | 2 | 0.40 | 1 | 19.52 | 22.72 | 19 | 22 |
| 359 | 0.76 | 661.5 | 416.5 | 122.5 | 7.0 | 5 | 0.25 | 2 | 36.45 | 36.81 | 36 | 36 |
| 421 | 0.66 | 759.5 | 318.5 | 220.5 | 3.5 | 3 | 0.25 | 3 | 13.01 | 15.80 | 13 | 15 |
PreProcessing & Minor EDA ¶
- Preprocessing apa yang diperlukan?
In [10]:
print(data.info())
print(set(data["orientation"]))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 compactness 768 non-null float64
1 surface-area 768 non-null float64
2 wall-area 768 non-null float64
3 roof-area 768 non-null float64
4 overall-height 768 non-null float64
5 orientation 768 non-null int64
6 glazing-area 768 non-null float64
7 glazing-dist 768 non-null int64
8 heating-load 768 non-null float64
9 cooling-load 768 non-null float64
10 heat-cat 768 non-null int64
11 cool-cat 768 non-null int64
dtypes: float64(8), int64(4)
memory usage: 72.1 KB
None
{2, 3, 4, 5}
In [11]:
data['orientation'] = data['orientation'].astype('category')
data['heat-cat'] = data['heat-cat'].astype('category')
data['cool-cat'] = data['cool-cat'].astype('category')
data.describe(include="all")
Out[11]:
| compactness | surface-area | wall-area | roof-area | overall-height | orientation | glazing-area | glazing-dist | heating-load | cooling-load | heat-cat | cool-cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.00000 | 768.0 | 768.000000 | 768.00000 | 768.000000 | 768.000000 | 768.0 | 768.0 |
| unique | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | NaN | 37.0 | 39.0 |
| top | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN | NaN | NaN | 12.0 | 14.0 |
| freq | NaN | NaN | NaN | NaN | NaN | 192.0 | NaN | NaN | NaN | NaN | 84.0 | 82.0 |
| mean | 0.764167 | 671.708333 | 318.500000 | 176.604167 | 5.25000 | NaN | 0.234375 | 2.81250 | 22.307195 | 24.587760 | NaN | NaN |
| std | 0.105777 | 88.086116 | 43.626481 | 45.165950 | 1.75114 | NaN | 0.133221 | 1.55096 | 10.090204 | 9.513306 | NaN | NaN |
| min | 0.620000 | 514.500000 | 245.000000 | 110.250000 | 3.50000 | NaN | 0.000000 | 0.00000 | 6.010000 | 10.900000 | NaN | NaN |
| 25% | 0.682500 | 606.375000 | 294.000000 | 140.875000 | 3.50000 | NaN | 0.100000 | 1.75000 | 12.992500 | 15.620000 | NaN | NaN |
| 50% | 0.750000 | 673.750000 | 318.500000 | 183.750000 | 5.25000 | NaN | 0.250000 | 3.00000 | 18.950000 | 22.080000 | NaN | NaN |
| 75% | 0.830000 | 741.125000 | 343.000000 | 220.500000 | 7.00000 | NaN | 0.400000 | 4.00000 | 31.667500 | 33.132500 | NaN | NaN |
| max | 0.980000 | 808.500000 | 416.500000 | 220.500000 | 7.00000 | NaN | 0.400000 | 5.00000 | 43.100000 | 48.030000 | NaN | NaN |
In [12]:
print("Duplikasi = ", data.duplicated().sum())
print(data.isnull().sum())
Duplikasi = 0 compactness 0 surface-area 0 wall-area 0 roof-area 0 overall-height 0 orientation 0 glazing-area 0 glazing-dist 0 heating-load 0 cooling-load 0 heat-cat 0 cool-cat 0 dtype: int64
In [13]:
# Warning agak lambat karena plot yg di generate cukup banyak
col_ = "surface-area wall-area roof-area overall-height heat-cat".split()
p = sns.pairplot(data[col_], hue="heat-cat")
In [14]:
# Challenge of the prediction
print(data["heat-cat"].value_counts())
p = sns.countplot(x="heat-cat", data=data)
12 84 14 67 32 56 11 45 15 45 10 43 28 43 29 38 36 31 24 31 16 27 13 26 35 20 17 17 39 16 40 16 26 16 25 16 19 13 33 12 31 12 6 12 23 12 18 10 41 10 38 9 42 8 27 5 22 5 37 5 7 4 34 4 8 4 30 2 20 2 21 1 43 1 Name: heat-cat, dtype: int64
In [15]:
data["orientation"].value_counts()
Out[15]:
2 192 3 192 4 192 5 192 Name: orientation, dtype: int64
In [16]:
# One-hot encoding, lalu menggabungkan dengan data awal
dum_ = pd.get_dummies(data['orientation'], prefix='ori')
data = pd.concat([data, dum_], axis = 1)
data.head()
Out[16]:
| compactness | surface-area | wall-area | roof-area | overall-height | orientation | glazing-area | glazing-dist | heating-load | cooling-load | heat-cat | cool-cat | ori_2 | ori_3 | ori_4 | ori_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 2 | 0.0 | 0 | 15.55 | 21.33 | 15 | 21 | 1 | 0 | 0 | 0 |
| 1 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 3 | 0.0 | 0 | 15.55 | 21.33 | 15 | 21 | 0 | 1 | 0 | 0 |
| 2 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 4 | 0.0 | 0 | 15.55 | 21.33 | 15 | 21 | 0 | 0 | 1 | 0 |
| 3 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 5 | 0.0 | 0 | 15.55 | 21.33 | 15 | 21 | 0 | 0 | 0 | 1 |
| 4 | 0.90 | 563.5 | 318.5 | 122.50 | 7.0 | 2 | 0.0 | 0 | 20.84 | 28.28 | 20 | 28 | 1 | 0 | 0 | 0 |
In [17]:
df2A = data[['compactness', 'surface-area', 'wall-area', 'roof-area', \
'overall-height','orientation','glazing-area','glazing-dist']]
df2B = data[['compactness', 'surface-area', 'wall-area', 'roof-area', \
'overall-height','ori_2', 'ori_3', 'ori_4', 'ori_5','glazing-area','glazing-dist']]
y2 = data['heat-cat']
df2B.head()
Out[17]:
| compactness | surface-area | wall-area | roof-area | overall-height | ori_2 | ori_3 | ori_4 | ori_5 | glazing-area | glazing-dist | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 1 | 0 | 0 | 0 | 0.0 | 0 |
| 1 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 0 | 1 | 0 | 0 | 0.0 | 0 |
| 2 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 0 | 0 | 1 | 0 | 0.0 | 0 |
| 3 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 0 | 0 | 0 | 1 | 0.0 | 0 |
| 4 | 0.90 | 563.5 | 318.5 | 122.50 | 7.0 | 1 | 0 | 0 | 0 | 0.0 | 0 |
Sebelum dimulai Data Kita Pisahkan Menjadi Train dan Test: Mengapa? ¶
- Bagaimana membagi Porsi Train VS Porsi Test Data?
- Hati-hati dalam memisahkan data Train dan Test: Mengapa?
In [18]:
df1_train, df1_test, y1_train, y1_test = train_test_split(df1, y1, test_size=0.3, random_state=33)
df2A_train, df2A_test, y2_train, y2_test = train_test_split(df2A, y2, test_size=0.3, random_state=33) #No One-Hot
df2B_train, df2B_test, y2_train, y2_test = train_test_split(df2B, y2, test_size=0.3, random_state=33) # One-Hot
print(df1_train.shape, df1_test.shape)
print(df2A_train.shape, df2A_test.shape)
(104, 4) (45, 4) (537, 8) (231, 8)
Sampel ke Populasi: Underfitting dan overfitting ¶
Parsimoni: Simple is the Best ¶
k-Nearest Neighbour ¶
- Classifier yang paling sederhana, namun dapat juga digunakan untuk regresi (dan bahkan clustering).
- Sering disebut sebagai Instance based Learner
- Tidak memiliki "persamaan", pendekatannya lebih ke algoritmik berdasarkan konsep jarak/similarity
- Mirip konsep DBSCAN
k-NN Neighbour Size & Weights ¶
- Uniform: all points in each neighborhood are weighted equally.
- Distance: closer neighbors of a query point have a greater influence than the neighbors further away.
Similarity VS Distance ¶
Similarity explained in plain terms and its application in Python¶
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/¶
Kelebihan dan Kekurangan ¶
- Pros:
-
- Relatif cepat (efisien) untuk data yang tidak terlalu besar
- Sederhana, mudah untuk diimplementasikan
- Mudah untuk di modifikasi: Berbagai macam formula jarak/similaritas
- Menangani data Multiclass dengan mudah
- Akurasi cukup baik jika data representatif
- Cons:
-
- Menemukan nearest neighbours tidak efisien untuk data besar
- Storage of data
- Meyakinkan rumus jarak yang tepat
Aplikasi di Python¶
In [19]:
# k-NN: http://scikit-learn.org/stable/modules/neighbors.html
n_neighbors = 3
weights = 'distance'
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNN.fit(df1_train, y1_train)
print('Done!')
Done!
In [20]:
# Prediksi dengan k-NN
y_kNN1 = kNN.predict(df1_test)
y_kNN1[-10:]
Out[20]:
array(['virginica', 'virginica', 'versicolor', 'versicolor', 'setosa',
'versicolor', 'versicolor', 'versicolor', 'setosa', 'setosa'],
dtype=object)
In [21]:
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNN.fit(df2B_train, y2_train)
y_kNN2 = kNN.predict(df2B_test)
y_kNN2[-10:]
Out[21]:
array([13, 25, 29, 29, 29, 17, 10, 39, 10, 11], dtype=int64)
Seberapa Baik Hasil Prediksi Ini?: Evaluation Metrics ¶
Confusion Matrix ¶
- sensitivity, recall, hit rate, or true positive rate (TPR)
- precision or positive predictive value (PPV)
- $0\leq F\leq 1$, 1 optimal value
- $0\leq\beta< \inf$
- beta < 1 lends more weight to precision,
- beta > 1 favors recall
- beta -> 0 considers only precision
- beta -> inf only recall
In [22]:
print("Kasus 01 - Bunga Iris: kNN")
print(confusion_matrix(y1_test, y_kNN1))
print(classification_report(y1_test, y_kNN1))
Kasus 01 - Bunga Iris: kNN
[[11 0 0]
[ 0 15 0]
[ 0 2 17]]
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.88 1.00 0.94 15
virginica 1.00 0.89 0.94 19
accuracy 0.96 45
macro avg 0.96 0.96 0.96 45
weighted avg 0.96 0.96 0.96 45
In [23]:
print("Kasus 02 - Building Energy")
print(confusion_matrix(y2_test, y_kNN2))
print(classification_report(y2_test, y_kNN2))
Kasus 02 - Building Energy
[[1 0 0 ... 0 0 0]
[0 1 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 1]
[0 0 0 ... 0 1 0]
[0 0 0 ... 1 0 0]]
precision recall f1-score support
6 1.00 0.33 0.50 3
7 1.00 0.50 0.67 2
8 0.00 0.00 0.00 1
10 0.13 0.15 0.14 13
11 0.15 0.14 0.15 14
12 0.06 0.08 0.07 25
13 0.06 0.12 0.08 8
14 0.00 0.00 0.00 22
15 0.22 0.17 0.19 12
16 0.18 0.25 0.21 8
17 0.00 0.00 0.00 2
18 0.00 0.00 0.00 1
19 0.00 0.00 0.00 5
20 0.00 0.00 0.00 1
21 0.00 0.00 0.00 1
22 0.00 0.00 0.00 1
23 0.00 0.00 0.00 6
24 0.07 0.12 0.09 8
25 0.00 0.00 0.00 2
26 0.00 0.00 0.00 6
27 0.00 0.00 0.00 1
28 0.08 0.06 0.07 16
29 0.00 0.00 0.00 11
30 0.00 0.00 0.00 1
31 0.00 0.00 0.00 3
32 0.06 0.07 0.07 14
33 0.00 0.00 0.00 5
34 0.00 0.00 0.00 2
35 0.25 0.14 0.18 7
36 0.00 0.00 0.00 10
37 0.00 0.00 0.00 3
38 1.00 0.17 0.29 6
39 0.00 0.00 0.00 3
40 0.00 0.00 0.00 4
41 0.33 0.33 0.33 3
42 0.00 0.00 0.00 1
accuracy 0.08 231
macro avg 0.13 0.07 0.08 231
weighted avg 0.11 0.08 0.09 231
In [24]:
# Cross validation
# Perhatikan variabelnya, kita sekarang menggunakan seluruh data
# namun sebaiknya hanya Train data (jika datanya cukup besar)
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
mulai = time.time()
scores_kNN = cross_val_score(kNN, df1, y1, cv=10)
waktu = time.time() - mulai
# Interval Akurasi 95 CI
print("Accuracy k-NN: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_kNN.mean(), scores_kNN.std() * 2, waktu))
Accuracy k-NN: 0.97 (+/- 0.09), Waktu = 0.026 detik
In [25]:
# Visualisasi untuk mengevaluasi model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN})
p = sns.boxplot(data = df_)
min(scores_kNN)
Out[25]:
0.8666666666666667
Klasifikasi dengan Model Regresi Logistik ¶
- Mencari garis lurus yang sedemikian sehingga kesalahan prediksinya sekecil mungkin (lihat gambar)
- Awalnya regresi logistik adalah metode klasifikasi binary: membedakan antara 2 kelas atau kategori.
- Masalah klasifikasi binary contohnya memprediksi seseorang terkena "kanker" atau "tidak kanker", kanker jinak/ganas, fraud atau bukan fraud (pada transaksi keuangan), negatif/positif dalam sentimen analisis, dsb.
- Regresi logistik adalah pengembangan dari model regresi liniear, namun di konversi ke masalah klasifikasi.
Regresi Logistik ¶
- http://www.saedsayad.com/logistic_regression.htm
- Makna fungsi logarithm?
- Konsekuensi dari rumus $\beta$ diatas?
- Asumsi?
Kaitan Regresi Logistik dan Neural Network/Deep Learning ¶
Kelebihan dan Kekurangan Regresi Logistik ¶
In [26]:
reglog = LogisticRegression().fit(df1_train, y1_train)
y_reglog1 = reglog.predict(df1_test)
print("Kasus 01 - Bunga Iris: Regresi Logistik")
print(confusion_matrix(y1_test, y_reglog1))
print(classification_report(y1_test, y_reglog1))
Kasus 01 - Bunga Iris: Regresi Logistik
[[11 0 0]
[ 0 15 0]
[ 0 3 16]]
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.83 1.00 0.91 15
virginica 1.00 0.84 0.91 19
accuracy 0.93 45
macro avg 0.94 0.95 0.94 45
weighted avg 0.94 0.93 0.93 45
In [27]:
mulai = time.time()
scores_regLog = cross_val_score(reglog, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy Regresi Logistik: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_regLog.mean(), scores_regLog.std() * 2, waktu))
Accuracy Regresi Logistik: 0.97 (+/- 0.07), Waktu = 0.164 detik
In [28]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN, 'RegLog': scores_regLog})
p = sns.boxplot(data = df_)
df_.min()
Out[28]:
kNN 0.866667 RegLog 0.933333 dtype: float64
Naive Bayes Classifier ¶
- P(x) konstan, sehingga bisa diabaikan.
- Asumsi terkuatnya adalah independensi antar variabel prediktor (sehingga dikatakan "Naive")
- Klasifikasi dilakukan dengan menghitung probabilitas untuk setiap kategori ketika diberikan data x = (x1,x2,...,xm)
- Variasi NBC adalah bagaimana P(c|x) dihitung, misal dengan distribusi Gaussian (Normal) - sering disebut sebagai Gaussian Naive Bayes (GNB):
Kelebihan dan Kekurangan Naive Bayes Classifier ¶
Pros:
- Cepat dan mudah di implementasikan
- Cocok untuk permasalahan multiclass
- Jika asumsi terpenuhi (independent) biasanya performanya cukup baik dan membutuhkan data (training) yang lebih sedikit.
- Biasanya baik digunakan untuk prediktor kategorik, untuk numerik NBC mengasumsikan distribusi normal (terkadang tidak terpenuhi)
Cons:
- Jika di test data memuat kategori yang tidak ada di training data ( ==> probabilitas = 0). Sering disebut sebagai masalah “Zero Frequency”.
- Asumsi yang sangat kuat (independen antar prediktor).
In [29]:
# Naive Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html
gnb = GaussianNB()
nbc = gnb.fit(df1_train, y1_train)
y_nb1 = nbc.predict(df1_test)
print(confusion_matrix(y1_test, y_nb1))
print(classification_report(y1_test, y_nb1))
[[11 0 0]
[ 0 15 0]
[ 0 2 17]]
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.88 1.00 0.94 15
virginica 1.00 0.89 0.94 19
accuracy 0.96 45
macro avg 0.96 0.96 0.96 45
weighted avg 0.96 0.96 0.96 45
In [30]:
mulai = time.time()
scores_nb = cross_val_score(nbc, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy Naive Bayes: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_nb.mean(), scores_nb.std() * 2, waktu))
Accuracy Naive Bayes: 0.95 (+/- 0.09), Waktu = 0.024 detik
In [31]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN, 'RegLog': scores_regLog, 'NaiveBys':scores_nb})
p = sns.boxplot(data = df_)
df_.min()
Out[31]:
kNN 0.866667 RegLog 0.933333 NaiveBys 0.866667 dtype: float64
Decision Tree Analogi ¶
Decision Tree (Pohon Keputusan) ¶
Decision Tree (Pohon Keputusan): Contoh Aplikasi ¶
Teori Decision Tree : Entropy Formula ¶
Teori Decision Tree : Entropy Calculation ¶
Teori Decision Tree : Gain Formula ¶
Teori Decision Tree : Gain Calculation ¶
- Contoh Lain: http://www.saedsayad.com/decision_tree.htm
- Ross Quinlan Website: https://www.rulequest.com/Personal/
Teori Decision Tree : Information theory ¶
- Alternative to Information Gain : Gini Index (CART): https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1
Pengaruh "ketinggian" tree terhadap bentuk model ¶
Decision Tree (Pohon Keputusan): Kelebihan & Kekurangan ¶
When to use:
- Target : Binomial/nominal.
- Predictors (input): binomial, nominal, and-or interval (ratio).
Advantage:
- Fast and embarrassingly parallel.
- Tanpa iterasi, cocok untuk Big Data technology (e.g. Hadoop)[map-reduce friendly]
- Interpretability
- Robust terhadap outliers & missing values
Disadvantage:
- Non probabilistic (ad hoc heuristic) +/-
- Target dengan banyak kelas
- Sensitive (instability)
In [32]:
# Decision Tree: http://scikit-learn.org/stable/modules/tree.html
DT = tree.DecisionTreeClassifier()
# Sengaja menggunakan default parameter, (Hyper)parameter Optimization akan dibahas kemudian
DT = DT.fit(df1_train, y1_train)
y_DT1 = DT.predict(df1_test)
print(confusion_matrix(y1_test, y_DT1))
print(classification_report(y1_test, y_DT1))
[[11 0 0]
[ 0 15 0]
[ 0 4 15]]
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.79 1.00 0.88 15
virginica 1.00 0.79 0.88 19
accuracy 0.91 45
macro avg 0.93 0.93 0.92 45
weighted avg 0.93 0.91 0.91 45
In [33]:
# Varible importance - Salah satu kelebihan Decision Tree
DT.feature_importances_
Out[33]:
array([0.01933984, 0.01450488, 0.06045018, 0.90570509])
In [40]:
clf = tree.DecisionTreeClassifier(random_state=0)
clf = clf.fit(df1_train, y1_train)
p = tree.plot_tree(clf)
In [41]:
mulai = time.time()
scores_dt = cross_val_score(DT, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy Decision Tree: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_dt.mean(), scores_dt.std() * 2, waktu))
Accuracy Decision Tree: 0.95 (+/- 0.09), Waktu = 0.024 detik
In [42]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN, 'RegLog': scores_regLog, 'NaiveBys':scores_nb, "DecTree":scores_dt})
p = sns.boxplot(data = df_)
df_.min()
Out[42]:
kNN 0.866667 RegLog 0.933333 NaiveBys 0.866667 DecTree 0.866667 dtype: float64
Curse of Dimensionality ¶
Curse of Dimensionality & Random Forest ¶
In [43]:
# Mari coba perbaiki dengan Random Forest
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
rf = RandomForestClassifier()
rf.fit(df1_train, y1_train)
y_rf1 = rf.predict(df1_test)
print(confusion_matrix(y1_test, y_rf1))
print(classification_report(y1_test, y_rf1))
[[11 0 0]
[ 0 15 0]
[ 0 3 16]]
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 0.83 1.00 0.91 15
virginica 1.00 0.84 0.91 19
accuracy 0.93 45
macro avg 0.94 0.95 0.94 45
weighted avg 0.94 0.93 0.93 45
In [44]:
# Varible importance
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(df1.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(df1.shape[1]), importances[indices],color="r", yerr=std[indices], align="center")
plt.xticks(range(df1.shape[1]), indices)
plt.xlim([-1, df1.shape[1]])
plt.show()
Feature ranking: 1. feature 3 (0.458596) 2. feature 2 (0.400527) 3. feature 0 (0.109768) 4. feature 1 (0.031108)
In [45]:
mulai = time.time()
scores_rf = cross_val_score(rf, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy Random Forest: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_rf.mean(), scores_rf.std() * 2, waktu))
Accuracy Random Forest: 0.96 (+/- 0.09), Waktu = 0.825 detik
In [46]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN, 'RegLog': scores_regLog, 'NaiveBys':scores_nb, "DecTree":scores_dt, "Forest": scores_rf})
p = sns.boxplot(data = df_)
df_.min()
Out[46]:
kNN 0.866667 RegLog 0.933333 NaiveBys 0.866667 DecTree 0.866667 Forest 0.866667 dtype: float64
Model yang lebih kompleks belum tentu lebih baik, Mengapa? ¶
In [47]:
# Saving Results untuk digunakan di module selanjutnya
import pickle
f = open('data/data_Module-11.pckl', 'wb')
pickle.dump((df_, df1, y1, df2A, df2B, y2), f)
f.close()
"Done"
Out[47]:
'Done'
Part 02: Model Klasifikasi Lanjutan ¶
- Support Vector Machines
- Evaluasi revisited: Underfitting & Overfitting
- Pipelining & Parameter Optimization
- Proper Model Selection
- Ensemble Learning
- Imbalance Learning
- Studi Kasus
In [48]:
# Loading Modules
import warnings; warnings.simplefilter('ignore')
import pickle
import pandas as pd, matplotlib.pyplot as plt
import time, numpy as np, seaborn as sns
from sklearn import svm, preprocessing
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection
from collections import Counter
from tqdm import tqdm
sns.set(style="ticks", color_codes=True)
print(pd.__version__)
"Done"
1.5.1
Out[48]:
'Done'
In [49]:
# Mulai dengan Load Data dari Modul sebelumnya terlebih dahulu
file_ = "data/data_Module-11.pckl"
try: # Running Locally, yakinkan "file_" berada di folder "data"
f = open(file_, 'rb')
data = pickle.load(f); f.close()
except: # Running in Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/{file_}
f = open(file_, 'rb')
data = pickle.load(f); f.close()
df_, df1, y1, df2A, df2B, y2 = data
df_.shape, df_.keys()
Out[49]:
((10, 5), Index(['kNN', 'RegLog', 'NaiveBys', 'DecTree', 'Forest'], dtype='object'))
In [50]:
# Akan sama dengan module sebelumnya karena nilai SEED sama.
df1_train, df1_test, y1_train, y1_test = train_test_split(df1, y1, test_size=0.3, random_state=33)
df2A_train, df2A_test, y2_train, y2_test = train_test_split(df2A, y2, test_size=0.3, random_state=33) #No One-Hot
df2B_train, df2B_test, y2_train, y2_test = train_test_split(df2B, y2, test_size=0.3, random_state=33) # One-Hot
"Done"
Out[50]:
'Done'
Support Vector Machine (SVM) ¶
Misal data dinyatakan sebagai berikut: $\{(\bar{x}_1,y_1),...,(\bar{x}_n,y_n)\}$, dimana $\bar{x}_i$ adalah input pattern untuk data ke $i^{th}$ dan $y_i$ adalah nilai target yang diinginkan. Kategori (class) direpresentasikan dengan $y_i=\{-1,1\}$. Sebuah bidang datar (hyperplane) yang memisahkan kedua kelas ini ("linearly separable") adalah: $$ \bar{w}'\bar{x} + b=0 $$ dimana $\bar{x}$ adalah input vector (prediktor), $\bar{w}$ weight, dan $b$ disebut sebagai bias.
Kelebihan Pemodelan SVM ¶
Support Vector Machine: Soft Margin ¶
- Diselesaikan dengan "mudah" via linear/quadratic programming.
- Fungsi ini **Convex** sehingga penyelesaiannya menghasilkan nilai Global Optimal.
- Interpretasi: Recursive Feature Elimination (RFE) method melihat bentuk kuadrat dari setiap komponen w (higher better).
SVM Kernel (trick): $R^m \rightarrow R^n, n\geq m$ ¶
Contoh Fungsi Kernel ¶
- Misal x = (x1, x2, x3); y = (y1, y2, y3).
- dan fungsi pemetaan variabelnya f(x) = (x1², x1x2, x1x3, x2x1, x2², x2x3, x3x1, x3x2, x3²),
- maka kernelnya adalah K(x, y ) = <f(x), f(y)> = <x, y>².
- Contoh numerik misal x = (1, 2, 3) dan y = (4, 5, 6). maka:
- f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36) - <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024
- complicated!... Menggunakan fungsi kernel perhitungannya bisa disederhanakan:
- K(x, y) = (4 + 10 + 18)² = 32² = 1024
- Artinya perhitungan di dimensi yang tinggi dapat dilakukan di dimensi satu via inner product!.
Contoh Fungsi Kernel yang Populer ¶
Kelebihan dan Kekurangan SVM ¶
Pros
- Akurasinya Baik
- Bekerja dengan baik untuk sampel data yang relatif kecil
- Hanya bergantung pada SV ==> meningkatkan efisiensi
- Convex ==> Minimum Global ==> Pasti Konvergen
Cons
- Tidak efisien untuk data yang besar
- Akurasi terkadang rendah untuk multiklasifikasi (sulit mendapatkan hubungan antar kategori di modelnya)
- Tidak robust terhadap noise
Bacaan lebih lanjut:
In [51]:
# Fitting and evaluate the model
dSVM = svm.SVC(C = 10**5, kernel = 'linear')#Misal menggunakan kernel Linear
dSVM.fit(df1_train, y1_train)
y_SVM1 = dSVM.predict(df1_test)
print(confusion_matrix(y1_test, y_SVM1))
print(classification_report(y1_test, y_SVM1))
[[11 0 0]
[ 0 15 0]
[ 0 0 19]]
precision recall f1-score support
setosa 1.00 1.00 1.00 11
versicolor 1.00 1.00 1.00 15
virginica 1.00 1.00 1.00 19
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
In [52]:
# The Support Vectors
print('index dr SV-nya: ', dSVM.support_)
print('Vector Datanya: \n', dSVM.support_vectors_)
index dr SV-nya: [14 41 83 0 12 80 93 15 18 31 42 43 88] Vector Datanya: [[4.5 2.3 1.3 0.3] [5.1 3.8 1.9 0.4] [5.1 3.3 1.7 0.5] [5.1 2.5 3. 1.1] [5.9 3.2 4.8 1.8] [6. 2.7 5.1 1.6] [6.7 3. 5. 1.7] [6.3 2.8 5.1 1.5] [7.2 3. 5.8 1.6] [6.1 3. 4.9 1.8] [6.5 3.2 5.1 2. ] [6.3 2.7 4.9 1.8] [4.9 2.5 4.5 1.7]]
In [53]:
# Model Weights for interpretations
print('w = ',dSVM.coef_)
print('b = ',dSVM.intercept_)
w = [[-0.04630589 0.52106895 -1.00301941 -0.46411937] [ 0.04017805 0.17410509 -0.55713561 -0.2437469 ] [ 3.71728259 3.70419407 -7.34998017 -8.65277018]] b = [ 1.45332688 1.28948112 17.22405189]
In [54]:
# Menggunakan Kernel: http://scikit-learn.org/stable/modules/svm.html#svm-kernels
for kernel in ('sigmoid', 'poly', 'rbf', 'linear'):
dSVM = svm.SVC(kernel=kernel)
dSVM.fit(df1_train, y1_train)
y_SVM = dSVM.predict(df1_test)
print(accuracy_score(y1_test, y_SVM))
0.24444444444444444 0.9777777777777777 0.9333333333333333 0.9555555555555556
In [55]:
dSVM = svm.SVC(C = 10**5, kernel = 'linear')
mulai = time.time()
scores_svm = cross_val_score(dSVM, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy SVM: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_svm.mean(), scores_svm.std() * 2, waktu))
Accuracy SVM: 0.98 (+/- 0.09), Waktu = 0.094 detik
In [56]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_['SVM'] = scores_svm
p = sns.boxplot(data = df_)
df_.min()
Out[56]:
kNN 0.866667 RegLog 0.933333 NaiveBys 0.866667 DecTree 0.866667 Forest 0.866667 SVM 0.866667 dtype: float64
Induktif Bias ¶
- Bias penaksiran parameter (statistik)
- Induktif Bias Sample (Machine Learning - Tom Mitchel)
- Induktif Bias Pemilihan Classifier (Statistical Learning Theory - Vapnik)
(Hyper)Parameter Optimization ¶
- Perbandingan yang baru saja kita lakukan walau sudah CV, namun belum sepenuhnya valid.
- Saat membandingkan model, maka kita harus meyakinkan seluruh model mendapatkan parameternya yang optimal.
In [57]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
file = 'data/diabetes_data.csv'
try:
# Local jupyter notebook, assuming "file" is in the "data" directory
data = pd.read_csv(file, names=names)
except:
# it's a google colab... create folder data and then download the file from github
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/{file}
data = pd.read_csv(file, names=names)
print(data.shape, set(data['class']))
data.sample(5)
(768, 9) {0, 1}
Out[57]:
| preg | plas | pres | skin | test | mass | pedi | age | class | |
|---|---|---|---|---|---|---|---|---|---|
| 424 | 8 | 151 | 78 | 32 | 210 | 42.9 | 0.516 | 36 | 1 |
| 77 | 5 | 95 | 72 | 33 | 0 | 37.7 | 0.370 | 27 | 0 |
| 322 | 0 | 124 | 70 | 20 | 0 | 27.4 | 0.254 | 36 | 1 |
| 443 | 8 | 108 | 70 | 0 | 0 | 30.5 | 0.955 | 33 | 1 |
| 699 | 4 | 118 | 70 | 0 | 0 | 44.5 | 0.904 | 26 | 0 |
In [58]:
# Split Train-Test
X = data.values[:,:8] # Slice data (perhatikan disini struktur data adalah Numpy Array)
Y = data.values[:,8]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=99)
print(set(Y), x_train.shape, x_test.shape, sep=', ')
{0.0, 1.0}, (614, 8), (154, 8)
Kita Jalankan Terlebih Dahulu dengan "Default Parameter" ¶
In [59]:
clf = LogisticRegression(solver='liblinear')
kNN = neighbors.KNeighborsClassifier()
gnb = GaussianNB()
dt = tree.DecisionTreeClassifier()
rf = RandomForestClassifier()
svm_ = svm.SVC()
Models = [('Regresi Logistik', clf), ('k-NN',kNN), ('Naive Bayes',gnb), ('Decision Tree', dt), ('Random Forest', rf), ('SVM', svm_)]
Scores = {}
for model_name, model in tqdm(Models):
Scores[model_name] = cross_val_score(model, x_train, y_train, cv=10, scoring='accuracy')
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
dt = pd.DataFrame.from_dict(Scores)
ax = sns.boxplot(data=dt, ax=ax)
for m, s in Scores.items():
print(m, list(s)[:4])
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:01<00:00, 4.84it/s]
Regresi Logistik [0.6290322580645161, 0.8225806451612904, 0.7741935483870968, 0.7096774193548387] k-NN [0.6935483870967742, 0.6935483870967742, 0.7258064516129032, 0.5967741935483871] Naive Bayes [0.7096774193548387, 0.8387096774193549, 0.6774193548387096, 0.7580645161290323] Decision Tree [0.6935483870967742, 0.7419354838709677, 0.6935483870967742, 0.7419354838709677] Random Forest [0.7096774193548387, 0.8064516129032258, 0.8064516129032258, 0.7741935483870968] SVM [0.7096774193548387, 0.7903225806451613, 0.6935483870967742, 0.7258064516129032]
Hyperparameter Optimization ¶
- Misal akan dicontohkan dua algoritma (model) yang sudah kita bahas sebelumnya: k-NN dan SVM
- Sebagai latihan silahkan untuk mencoba HO pada model yang lain.
- Parameter tiap model di ML berbeda-beda dan nilai optimalnya berbeda pada setiap kasus.
In [60]:
# Hyperparameter optimization pada model kNN menggunakan gridCV
kCV = 10
metric = 'accuracy'
params = {}
params['kneighborsclassifier__n_neighbors'] = [1, 3, 5, 10, 15, 20, 25, 30]
params['kneighborsclassifier__weights'] = ('distance', 'uniform')
pipe = make_pipeline(neighbors.KNeighborsClassifier())
optKnn = GridSearchCV(pipe, params, cv=kCV, scoring=metric, verbose=1, n_jobs=-2) #
optKnn.fit(x_train, y_train)
print(optKnn.best_score_)
print(optKnn.best_params_)
Fitting 10 folds for each of 16 candidates, totalling 160 fits
0.7297726070861978
{'kneighborsclassifier__n_neighbors': 20, 'kneighborsclassifier__weights': 'uniform'}
In [61]:
# Contoh Hyperparameter optimization pada model SVM menggunakan RandomizedSearchCV
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
# Berikut ini contoh bagaimana mengetahui parameter yang dapat kita optimasi.
# Gunakan pengetahuan teori/analitik untuk mengoptimasi hanya parameter yang paling penting.
pipeSVM = make_pipeline(svm.SVC())
print(sorted(pipeSVM.get_params().keys()))
['memory', 'steps', 'svc', 'svc__C', 'svc__break_ties', 'svc__cache_size', 'svc__class_weight', 'svc__coef0', 'svc__decision_function_shape', 'svc__degree', 'svc__gamma', 'svc__kernel', 'svc__max_iter', 'svc__probability', 'svc__random_state', 'svc__shrinking', 'svc__tol', 'svc__verbose', 'verbose']
In [62]:
# Optimal parameter SVM dengan RandomizedSearch
# WARNING cell ini butuh waktu komputasi cukup lama
kCV = 10
paramsSVM = {}
paramsSVM['svc__C'] = [1, 10, 100, 1000] #sp.stats.uniform(scale=100)
paramsSVM['svc__gamma'] = [0.1, 0.001, 0.0001, 1, 10]
paramsSVM['svc__kernel'] = ['rbf', 'sigmoid', 'linear'] # , 'poly'
optSvm = RandomizedSearchCV(pipeSVM, paramsSVM, cv=kCV, scoring=metric, verbose=2, n_jobs=-2) # refit=True, pre_dispatch='2*n_jobs' pre_dispatch min 2* n_jobs
optSvm.fit(x_train, y_train)
print(optSvm.best_score_)
print(optSvm.best_params_)
Fitting 10 folds for each of 10 candidates, totalling 100 fits
0.7736118455843469
{'svc__kernel': 'linear', 'svc__gamma': 1, 'svc__C': 1000}
Model Selection¶
In [63]:
kCV = 10
# Menggunakan parameter optimal
kNN = neighbors.KNeighborsClassifier(n_neighbors= 20, weights= 'uniform')
svm_ = svm.SVC(kernel= 'linear', gamma= 10, C= 10)
# Melakukan Cross Validasi
models = ['kNN', 'SVM']
knn_score = cross_val_score(kNN, x_test, y_test, cv=kCV, scoring='accuracy', n_jobs=-2, verbose=1)
svm_score = cross_val_score(svm_, x_test, y_test, cv=kCV, scoring='accuracy', n_jobs=-2, verbose=1)
scores = [knn_score, svm_score]
data = {m:s for m,s in zip(models, scores)}
for name in data.keys():
print("Accuracy %s: %0.2f (+/- %0.2f)" % (name, data[name].mean(), data[name].std() * 2))
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
p = sns.boxplot(data=pd.DataFrame(data), ax=ax)
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 15 concurrent workers. [Parallel(n_jobs=-2)]: Done 3 out of 10 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=-2)]: Done 10 out of 10 | elapsed: 0.0s finished [Parallel(n_jobs=-2)]: Using backend LokyBackend with 15 concurrent workers. [Parallel(n_jobs=-2)]: Done 3 out of 10 | elapsed: 1.2s remaining: 2.9s
Accuracy kNN: 0.71 (+/- 0.17) Accuracy SVM: 0.78 (+/- 0.20)
[Parallel(n_jobs=-2)]: Done 10 out of 10 | elapsed: 7.0s finished
Ensemble Model ¶
- What? a learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.
- Why? Better prediction, More stable model
- How? Bagging & Boosting
“meta-algorithms” : Bagging & Boosting ¶
Boosting in ML ¶
Property of Boosting ¶
In [64]:
# Contoh Voting (Bagging) di Python
# Catatan : Random Forest termasuk Bagging Ensemble (walau modified)
# Best practicenya Model yang di ensemble semuanya menggunakan Optimal Parameter
kNN = neighbors.KNeighborsClassifier(3)
kNN.fit(x_train, y_train)
Y_kNN = kNN.score(x_test, y_test)
DT = tree.DecisionTreeClassifier(random_state=1)
DT.fit(x_train, y_train)
Y_DT = DT.score(x_test, y_test)
model = VotingClassifier(estimators=[('k-NN', kNN), ('Decision Tree', DT)], voting='hard')
model.fit(x_train,y_train)
Y_Vot = model.score(x_test,y_test)
print('Akurasi k-NN', Y_kNN)
print('Akurasi Decision Tree', Y_DT)
print('Akurasi Votting', Y_Vot)
Akurasi k-NN 0.7142857142857143 Akurasi Decision Tree 0.6818181818181818 Akurasi Votting 0.7337662337662337
In [65]:
# Averaging juga bisa digunakan di Klasifikasi (ndak hanya Regresi),
# tapi kita pakai probabilitas dari setiap kategori
T = tree.DecisionTreeClassifier()
K = neighbors.KNeighborsClassifier()
R = LogisticRegression()
T.fit(x_train,y_train)
K.fit(x_train,y_train)
R.fit(x_train,y_train)
y_T=T.predict_proba(x_test)
y_K=K.predict_proba(x_test)
y_R=R.predict_proba(x_test)
Ave = (y_T+y_K+y_R)/3
print(Ave[:5]) # Print just first 5
prediction = [v.index(max(v)) for v in Ave.tolist()]
print(prediction[:5]) # Print just first 5
print('Akurasi Averaging', accuracy_score(y_test, prediction))
[[0.86747806 0.13252194] [0.96569617 0.03430383] [0.90409318 0.09590682] [0.81735063 0.18264937] [0.97683156 0.02316844]] [0, 0, 0, 0, 0] Akurasi Averaging 0.7467532467532467
In [66]:
# AdaBoost
num_trees = 100
kfold = model_selection.KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=33)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
0.7421565276828435
Imbalance Data ¶
- Metric Trap
- Akurasi kategori tertentu lebih penting
- Contoh kasus
Imbalance Learning ¶
- Undersampling, Oversampling, Model Based (weight adjustment)
- https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
- Plot perbandingan: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html#sphx-glr-auto-examples-combine-plot-comparison-combine-py
In [67]:
Counter(Y)
Out[67]:
Counter({1.0: 268, 0.0: 500})
In [68]:
# fit the model and get the separating hyperplane using weighted classes
svm_ = svm.SVC(kernel='linear')
svm_.fit(x_train, y_train)
y_SVMib = svm_.predict(x_test)
print(confusion_matrix(y_test, y_SVMib))
print(classification_report(y_test, y_SVMib))
[[93 12]
[19 30]]
precision recall f1-score support
0.0 0.83 0.89 0.86 105
1.0 0.71 0.61 0.66 49
accuracy 0.80 154
macro avg 0.77 0.75 0.76 154
weighted avg 0.79 0.80 0.79 154
In [69]:
# fit the model and get the separating hyperplane using weighted classes
# x_train, x_test, y_train, y_test
svm_balanced = svm.SVC(kernel='linear', class_weight={1: 3}) #WEIGHTED SVM
svm_balanced.fit(x_train, y_train)
y_SVMb = svm_balanced.predict(x_test)
print(confusion_matrix(y_test, y_SVMb))
print(classification_report(y_test, y_SVMb))
[[67 38]
[ 7 42]]
precision recall f1-score support
0.0 0.91 0.64 0.75 105
1.0 0.53 0.86 0.65 49
accuracy 0.71 154
macro avg 0.72 0.75 0.70 154
weighted avg 0.78 0.71 0.72 154
In [70]:
# Example of model-based imbalance treatment - SVM
from sklearn.datasets import make_blobs
n_samples_1, n_samples_2 = 1000, 100
centers = [[0.0, 0.0], [2.0, 2.0]]
clusters_std = [1.5, 0.5]
X, y = make_blobs(n_samples=[n_samples_1, n_samples_2],centers=centers,cluster_std=clusters_std,random_state=33, shuffle=False)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
# fit the model and get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10}) #WEIGHTED SVM
wclf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')# plot the samples
ax = plt.gca()# plot the decision functions for both classifiers
xlim = ax.get_xlim(); ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)# create grid to evaluate model
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane
a = ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5, linestyles=['-']) # plot decision boundary and margins
Z = wclf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane for weighted classes
b = ax.contour(XX, YY, Z, colors='r', levels=[0], alpha=0.5, linestyles=['-'])# plot decision boundary and margins for weighted classes
plt.legend([a.collections[0], b.collections[0]], ["non weighted", "weighted"], loc="upper right")
plt.show()
Weighted Decision Tree ¶
In [71]:
T = tree.DecisionTreeClassifier(random_state = 33)
T.fit(x_train,y_train)
y_DT = T.predict(x_test)
print('Akurasi (Decision tree Biasa) = ', accuracy_score(y_test, y_DT))
print(classification_report(y_test, y_DT))
T = tree.DecisionTreeClassifier(class_weight = 'balanced', random_state = 33)
T.fit(x_train, y_train)
y_DT = T.predict(x_test)
print('Akurasi (Weighted Decision tree) = ', accuracy_score(y_test, y_DT))
print(classification_report(y_test, y_DT))
Akurasi (Decision tree Biasa) = 0.6883116883116883
precision recall f1-score support
0.0 0.79 0.73 0.76 105
1.0 0.51 0.59 0.55 49
accuracy 0.69 154
macro avg 0.65 0.66 0.65 154
weighted avg 0.70 0.69 0.69 154
Akurasi (Weighted Decision tree) = 0.7207792207792207
precision recall f1-score support
0.0 0.83 0.74 0.78 105
1.0 0.55 0.67 0.61 49
accuracy 0.72 154
macro avg 0.69 0.71 0.69 154
weighted avg 0.74 0.72 0.73 154
Studi Kasus (Latihan) ENB2012: Prediksi Penggunaan Energi Gedung ¶
Task
- Filter data EcoTest dan pilih hanya yang kategori di variabel target muncul min 10 kali (heat-cat)
- Lakukan EDA (Preprocessing dan visualisasi dasar)
- Tentukan model terbaik (dengan parameter optimal dan cross validasi)
- Hati-hati Naive Bayes, Decision Tree dan Random Forest tidak memerlukan one-hot encoding.
- Gunakan Metric Micro F1-Score untuk menentukan model terbaiknya.
Optional
- Coba bandingkan model terbaik diatas dengan model ensemble.
- Apakah ada imbalance problem, coba atasi dengan over/under sampling.
In [72]:
file_ = "data/building-energy-efficiency-ENB2012_data.csv"
try: # Running Locally, yakinkan "file_" berada di folder "data"
data = pd.read_csv(file_)
except: # Running in Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/{file_}
data = pd.read_csv(file_)
print(data.shape)
data.sample(5)
(768, 12)
Out[72]:
| compactness | surface-area | wall-area | roof-area | overall-height | orientation | glazing-area | glazing-dist | heating-load | cooling-load | heat-cat | cool-cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 488 | 0.86 | 588.0 | 294.0 | 147.0 | 7.0 | 2 | 0.25 | 5 | 29.71 | 28.02 | 29 | 28 |
| 29 | 0.71 | 710.5 | 269.5 | 220.5 | 3.5 | 3 | 0.00 | 0 | 6.40 | 11.72 | 6 | 11 |
| 278 | 0.66 | 759.5 | 318.5 | 220.5 | 3.5 | 4 | 0.10 | 5 | 11.22 | 14.65 | 11 | 14 |
| 399 | 0.82 | 612.5 | 318.5 | 147.0 | 7.0 | 5 | 0.25 | 3 | 25.17 | 26.41 | 25 | 26 |
| 7 | 0.90 | 563.5 | 318.5 | 122.5 | 7.0 | 5 | 0.00 | 0 | 19.68 | 29.60 | 19 | 29 |
In [73]:
# Jawaban Latihan dimulai di cell ini